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Abstract 

We  describe  an  implementation  of  a  simple  probabilistic  link  grammar.  This  probabilistic 
l2nguage  model  extends  trigrams  by  allowing  a  word  to  be  predicted  not  only  from  the 
two  immediately  preceeding  words,  but  potentially  from  any  proceeding  pair  of  adjacent 
words  that  lie  within  the  same  sentence.  In  this  way,  the  trigram  model  can  skip  over  less 
informative  words  to  make  its  predictions.  The  underlying  “grammar”  is  nothing  more 
than  a  list  of  pairs  of  words  that  can  be  linked  together  with  one  or  more  intervening 
words;  this  word-pair  grammar  is  automatically  inferred  from  a  corpus  of  training  text. 
We  present  a  novel  technique  for  indexing  the  model  parameters  that  allows  us  to  avoid 
all  sorting  in  the  M-step  of  the  training  algorithm.  This  results  in  significant  savings  in 
computation  time,  and  is  applicable  to  the  training  of  a  general  probabilistic  link  grammar. 
Results  of  preliminary  experiments  carried  out  for  this  class  of  models  are  presented. 
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1  Introduction 


The  most  widely  used  statistical  model  of  language  is  the  so-called  trigram  model.  In  this 
simple  model,  a  word  is  predicted  based  solely  upon  the  two  words  which  immediately 
precede  it.  The  simplicity  of  the  trigram  model  is  simultaneously  its  greatest  strength 
and  weakness.  Its  strength  comes  from  the  fact  that  one  can  easily  estimate  trigram 
statistics  by  counting  over  hundreds  of  millions  of  words  of  data.  Since  implementation 
of  the  model  involves  only  table  lookup,  it  is  computationally  efficient,  and  can  be  used 
in  real-time  systems.  Yet  the  trigram  model  captures  the  statistical  relations  between 
words  by  the  sheer  force  of  numbers.  It  ignores  the  rich  syntactic  and  semantic  structure 
which  constrains  natural  languages,  allowing  them  to  be  easily  processed  and  understood 
by  humans. 

Probabilistic  link  grammar  has  been  proposed  as  an  approach  which  preserves  the 
strengths  and  computational  advantages  of  trigrams,  while  incorporating  long-range  de¬ 
pendencies  and  more  complex  information  into  a  statistical  model  [LST92].  In  this  paper 
we  describe  an  implementation  of  a  very  simple  probabilistic  link  grammar.  This  prob¬ 
abilistic  model  extends  trigrams  by  allowing  a  word  to  be  predicted  not  only  from  the 
two  immediately  preceding  words,  but  potentially  from  any  preceding  pair  of  adjacent 
words  that  lie  within  the  same  sentence.  In  this  way,  the  trigram  model  can  skip  over  less 
informative  words  to  make  its  predictions.  The  underlying  “grammar”  is  nothing  more 
than  a  list  of  pairs  of  words  that  can  be  linked  together  with  one  or  more  intervening 
words  between  them.  This  paper  presents  an  outline  of  the  basic  ideas  and  methods  used 
in  building  this  model. 

Section  2  gives  an  introduction  to  the  long-range  trigram  model  and  explains  how  it  can 
be  seen  as  a  probabilistic  link  grammar.  The  word-pair  grammar  is  automatically  inferred 
from  a  corpus  of  training  text.  While  mutual  information  can  be  used  as  a  heuristic  to 
determine  which  words  might  be  profitably  linked  together,  this  measure  alone  is  not 
adequate.  In  Section  3  we  present  a  technique  that  extends  mutual  information  to  suit 
our  needs.  The  parameter  estimation  algorithms,  which  derive  from  the  EM  algorithm, 
are  presented  in  Sections  4  and  5.  In  particular,  we  present  a  novel  technique  for  indexing 
the  model  parameters  that  allows  us  to  avoid  all  sorting  in  the  M-step  of  the  training 
algorithm.  This  results  in  significant  savings  in  computation  time,  and  is  applicable  to 
the  training  of  a  general  probabilistic  link  grammar.  In  Section  6  we  present  the  results 
of  preliminary  experiments  carried  out  using  this  approach. 


2  A  Long-Range  Trigram  Model 

As  a  motivating  example,  consider  the  picture  shown  below.  This  diagram  represents 
a  linkage  of  the  underlying  sentence  “Either  a  rioja  ...  suckling  pig”,  as  described  in 
[ST91].  The  important  characteristics  of  a  linkage  are  that  the  arcs,  or  links,  connecting 
the  various  words  do  not  cross,  that  there  is  no  more  than  one  link  between  any  pair  of 
words,  and  that  the  resulting  graph  is  connected.  Viewed  probabilistically,  we  imagine 
that  each  word  is  generated  from  the  bigram  ending  with  the  word  that  it  is  linked  to 
on  the  left.  Thus,  the  first  right  parenthesis  is  generated  from  the  bigram  (rioja,  “(”) 
while  the  word  “suckling”  is  generated  from  the  bigram  (roast, young).  The  word  “or” 
is  generated  from  the  bigram  (.L,  Either),  where  J.  is  a  boundary  word.  Another  valitl 
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linkage  would  connect  the  first  left  parenthesis  with  the  last  right  parenthesis,  but  this 
would  preclude  a  connection  between  the  words  “Either”  and  “or”  since  the  resulting 
links  would  cross. 


Either  a  rioja  (  La  Rioja  Alta  ’85  )  or  a  burgundy  (  La  Tache  ’83  ) 


goes  with  Botin’s  roast  young  suckling  pig 

To  describe  the  model  in  more  detail,  consider  the  following  description  of  standard 
trigrams.  This  model  is  viewed  as  a  simple  finite-state  machine  for  generating  sentences. 
The  states  in  the  machine  are  indexed  by  pairs  of  words.  Adjoining  the  boundary  word 
J.  to  our  vocabulary,  we  suppose  that  the  machine  begins  in  the  state  (±,J_).  When 
the  machine  is  in  any  given  state  {wi^W2)  it  progresses  to  state  (u’2,u;3)  with  probability 
T  {w3\wiW2)  and  halts  with  probability  T  (  ±  |  twi  u;2  ),  thus  ending  the  sentence. 

Our  extended  trigram  model  can  be  described  in  a  similar  fashion.  Again  states  are 
indexed  by  pairs  of  words,  but  a  state  s  =  (u>i,u;2)  can  now  either  halt,  step,  or  branch, 
with  probability  D  ( HALT  |  s  ),  D  ( STEP  |  s  ),  and  D  (  BRANCH  |  s  )  respectively.  In  case  either 
a  STEP  or  a  BRANCH  is  chosen,  the  next  word  w  is  generated  with  the  trigram  probability 
T  ( U)  I  i«i  tr>2  )•  But  in  the  case  that  BRANCH  was  chosen  for  the  state  s,  an  additional  word 
w'  is  generated  from  the  long-range  trigram  distribution  L  ( it?'  |  utj  tV2  ). 

For  example,  in  generating  the  above  linkage,  the  state  indexed  by  s  =  (or,  a)  chooses 
to  step,  with  probability  d(step|s),  and  the  word  “burgundy”  is  then  generated  with 
probability  T  ( burgundy  | or  a).  On  the  other  hand,  the  state  s  =  (X,  Either)  branches, 
with  probability  D  ( BRANCH  1  s ),  and  from  this  state  the  words  “a”  and  word  “or”  are 
then  generated  with  probabilities  T(a|  X  Either)  and  L(or  |  X  Either)  respectively. 

This  results  in  linkages,  such  as  the  one  shown  above,  where  every  word  is  linked  to 
exactly  one  word  to  its  left,  and  to  zero,  one,  or  two  words  on  its  right.  If  we  number 
the  words  in  the  sentence  S  from  1  to  |5|,  then  it  is  convenient  to  denote  by  <a?'  the  index 
of  the  word  which  generates  the  i-th  word  of  the  sentence.  That  is,  word  i  is  linked  to 
word  <i  on  its  left.  For  instance,  in  the  linkage  shown  above  we  see  that  <58  =  7,  <a9  =  4, 
and  <10  =  1.  This  notation  allows  us  to  write  down  the  probability  of  a  sentence  as 
^('^)  =  IIa6C(S)  where  C{S)  is  the  set  of  all  linkages  of  S,  and  where  the  joint 

probability  P{S,\)  of  a  sentence  and  a  linkage  is  given  by 


\s\ 

P(5,  A)  =  D  ( di  I  u?„  «?,_! )  T  ( lUi  I  Wi.i  Wf.i 

i=l 


) 


L  ( u;,  I  U7„i  ) 


(1) 

Here  di  €  {HALT, STEP, BRANCH},  ^(i,i)  is  equal  to  one  if  i  =  j  and  is  equal  to  zero 
otherwise,  and  the  indices  <i  are  understood  to  be  taken  with  respect  to  linkage  The 
indices  <t  determine  which  words  are  linked  together,  and  so  completely  determine  a 
valid  linkage  as  long  as  they  satisfy  the  no-crossing  condition  <j  <  <i  whenever  i  <  j.  In 
particular,  specifying  the  indices  <*  determines  the  values  of  di,  since  di  is  equal  to  HALT, 
STEP,  or  BRANCH  when  ^(*,<j)  is  equal  to  zero,  one,  or  two,  respectively. 

A  full  description  of  the  model  is  best  given  in  terms  of  a  probabilistic  pushdown 
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automaton.  The  automaton  maintains  a  stack  of  states  s,  where  s  is  indexed  by  a  word 
bigram,  and  a  finite  memory  containing  a  state  m.  It  is  governed  by  a  finite  control  that 
can  read  either  HALT,  STEP,  or  BRANCH.  Initially  the  stack  is  empty,  the  finite  memory 
contains  the  bigram  (-L,  J.),  and  the  finite  control  reads  STEP.  The  automaton  proceeds 
by  carrying  out  three  tasks.  First,  the  finite  control  is  read  and  a  word  is  output  with 
the  appropriate  distribution.  If  the  control  reads  either  STEP  or  BRANCH,  then  word  w  is 
output  with  probability  T(u;|m).  If  the  control  reads  HALT  then  the  automaton  looks 
at  the  stack.  If  the  stack  is  empty  the  machine  halts.  Otherwise,  the  state  s  is  popped  off 
the  stack  and  word  w  is  output  with  probability  L(u;|s).  Second,  the  memory  state  is 
changed  from  m  =  (toi,  wj)  to  m  =  {w2,w).  Third,  the  control  is  set  to  d  with  probability 
D(d|m),  and  state  m  is  pushed  onto  the  stack  if  the  new  setting  is  BRANCH.  The 
probability  with  which  this  machine  halts  after  outputting  sentence  S  is  precisely  the 
sum  IIa6£(5)  where  P(S,  A)  is  given  by  equation  (1). 

In  terms  of  link  grammar,  there  is  a  natural  equivalence  between  the  values  HALT. 
STEP,  and  BRANCH  and  three  simple  disjuncts,  specifying  how  a  word  connects  to  other 
words.  The  value  HALT  corresponds  to  a  disjunct  having  a  single  (unlabeled)  left  connec¬ 
tor,  and  no  right  connectors,  indicating  that  a  connection  can  be  made  to  any  word  on 
the  left,  but  to  no  word  on  the  right.  The  value  STEP  corresponds  to  a  disjunct  having  a 
single  left  connector  and  a  single  right  connector,  and  the  value  BRANCH  corresponds  to  a 
disjunct  having  a  single  left  connector  and  two  right  connectors.  With  this  grammar,  the 
probabilistic  model  described  above  is  a  simple  variant  of  the  general  probabilistic  model 
presented  in  [LST92]. 

In  terms  of  phrase  structure  grammar,  the  constructive  equivalence  between  link  gram¬ 
mar  and  context  free  grammars  given  in  [ST91]  can  be  extended  probabilistically.  This 
shows  how  the  above  model  is  equivalent  to  the  following  standard  probabilistic  context- 
free  model: 

Kw  -  Kru  D  (  BRANCH  I  u,  u; ) 

D(STEP|u,«;) 

D(HALT|i;,it;) 

L(t/luu;) 

T(;|uu;). 

Here  x,y,z  are  vocabulary  words  with  i,y  and  A,  B,  and  C  are  families  of  nonter¬ 
minals  parameterized  by  triples  of  words  u,u,U7.  The  corresponding  rule  probabilities  are 
given  in  the  second  column.  The  start  nonterminal  of  the  grammar  is  S  =  A];  This 
view  of  the  model  is  unwieldy  and  unnatural,  and  does  not  benefit  from  the  efficient  link 
grammar  parsing  and  pruning  algorithms. 

3  Inferring  the  Grammar 

The  probabilistic  model  described  in  the  previous  section  makes  its  predictions  using  both 
long-range  and  short-range  trigrams.  In  principle,  we  can  allow  a  word  to  be  linked  to 
any  word  to  its  left.  This  corresponds  to  a  “grammar”  that  allows  a  long-range  link 
between  any  pair  of  words.  The  number  of  possible  linkages  for  this  grammar  grows 
rapidly  with  sentence  length:  while  a  10-word  sentence  has  only  835  possible  linkages,  a 
25- word  sentence  has  3,192,727,797  linkages  (see  Appendix  A).  Yet  most  of  the  long-range 
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links  in  these  linkages  are  likely  to  be  spurious.  The  resulting  probabilistic  model  has  far 
too  many  parameters  than  can  be  reliably  estimated. 

Since  an  unrestricted  grammar  is  impractical,  we  would  like  to  restrict  the  grammar  to 
allow  those  long-range  links  that  bring  the  most  improvement  to  the  probabilistic  model. 
Ideally,  we  would  like  to  automatically  discover  pairs  of  words  such  as  and  with 
long-range  correlations  that  are  good  candidates  to  be  connected  by  a  long-range  link. 
We  might  find  such  pairs  by  looking  for  words  with  high  mutual  information.  But  if 
we  imagine  that  we  have  already  included  all  nearest  neighbor  links  in  our  model,  as  is 
the  case  for  the  model  (1),  there  is  no  point  in  linking  up  a  pair  of  words  L  and  R,  no 
matter  how  high  their  mutual  information,  if  R  is  already  well-predicted  by  its  immediate 
predecessor.  Instead,  we  would  like  to  find  links  between  words  that  have  the  potential 
of  improving  a  model  with  only  short-range  links. 

To  find  such  pairs  we  adopt  the  following  approach.  Let  V  be  the  language  vocabulary. 
For  each  pair  (L,  R)  £  V  x  V ^  ■we  construct  ^  model  that  contains  all  the  bigram 
links  together  with  only  one  additional  long-range  link:  that  from  L  to  R.  We  choose  the 
models  to  be  simple  enough  so  that  the  parameters  of  all  the  \V\^  possible  models 
can  be  estimated  in  parallel.  We  then  rank  the  models  according  to  the  likelihood  each 
eissigns  to  the  training  corpus,  and  choose  those  pairs  (L,  R)  corresponding  to  the  highest 
ranked  models.  This  list  of  word  pairs  constitutes  the  “grammar"  as  described  in  the 
previous  section. 

The  model  P^r  that  we  construct  for  a  particular  pair  (L,  R)  is  a  simplification  of  the 
model  of  the  previous  section;  it  can  be  described  as  a  probabilistic  finite  state  automaton 
(and  thus  it  requires  no  stack).  Before  explaining  the  details  of  the  model,  consider  the 
standard  bigram  model  B  ( u;'  |  lu )  viewed  as  a  probabilistic  finite  state  machine.  This 
machine  maintains  a  finite  memory  m  that  contains  the  previously  generated  word,  and 
can  be  in  one  of  two  states,  as  shown  in  Figure  1.  The  machine  begins  in  state  1  with 
m  =J..  The  machine  operates  by  making  a  state  transition,  outputting  a  word  tv,  and 
then  setting  the  memory  m  to  w.  More  precisely,  the  machine  remains  in  state  1  with 
probability  D  ( STEP  |m)  =  l  —  B(±|m).  Given  that  a  transition  to  state  1  is  made,  the 
word  w  is  output  with  probability  B'(w|Tn),  where 


b'  ( u;  I  m  ) 


B  (w  1  m  ) 

1  —  B  (  X  1  m  )  ’ 


Alternatively,  the  machine  outputs  X  and  proceeds  to  state  2,  where  it  halts,  with  prob¬ 
ability  D  (  HALT  I  m  )  =  B  (  X  I  m  ). 

The  machine  underlying  our  probabilistic  model  P^r  is  depicted  in  a  similar  fashion 
in  Figure  2.  Like  the  bigram  machine,  our  new  automaton  maintains  a  memory  m  of 
the  most  recently  output  word  and  begins  in  state  1  with  m  =X.  Unlike  the  bigram 
machine,  it  enters  a  special  state  whenever  word  L  is  generated:  from  state  1,  the  machine 
outputs  L,  sets  m  =  L,  and  makes  a  transition  to  state  ;j  with  probability  B  ( L  |  tn  ).  .4 
transition  from  state  .3  back  to  state  1  is  made  with  probability  Dlr  (step|  L  ).  In  this 
case,  no  word  is  output,  and  the  memory  m  remains  set  to  L.  Alternatively,  from  state 
3  the  machine  can  output  a  word  w  ^  L,R  and  proceed  to  state  4  with  probability 
Dlr  (  branch!  ^  )i  where 


,  I _ B  ( tu  I  m  ) _ 

1  —  B(L(m)  —  B(P|m)  —  B(X|m) 


4 


w*L 


w 


Figure  1;  A  bigram  machine 


Figure  2:  An  (L,  R)  machine 


Once  in  state  4,  the  machine  behaves  much  like  the  original  bigram  machine,  except  that 
neither  an  L  nor  a  R  can  be  generated.  Word  w  is  output  and  the  machine  remains  in 
state  4  with  probability  Dm  (STEP|  m  )  Blr  (iu|  m  );  it  makes  a  transition  back  to  state  1 
and  outputs  word  R  with  probability  Dlr  ( HALT]  m  ). 

According  to  this  probabilistic  finite  state  machine,  words  are  generated  by  a  bigram 
model  except  for  the  word  /?,  which  is  generated  either  from  its  immediate  predecessor 
or  from  the  closest  L  to  its  left.  Maximum-likelihood  training  of  this  machine  yields  an 
estimate  of  the  reduction  in  entropy  over  the  bigram  model  afforded  by  allowing  long-range 
links  between  L  and  R  in  the  general  model  presented  in  Section  2. 

Training  of  the  models  Plr  for  many  {L,  R)  pairs  in  parallel  is  facilitated  by  two  approx¬ 
imating  assumptions  on  the  parameters.  First,  we  assume  that  Blr  ( ty|  m  )  =  B  ( ty  |  m  ). 
Second,  we  assume  that  DLR(d|m)  =  Dlr  (d)  for  m  ^  L.  Under  this  assumption,  the 
parameter  Dlr  (HALT)  encodes  the  distribution  of  the  number  of  words  between  L  and  /?; 
in  the  hidden  model  the  number  of  words  between  them  is  geometrically  distributed  with 
mean  Dlr  (HALT)“^ 

Each  model  Plr  can  be  viewed  as  a  link  grammar  enhancement  of  the  bigram  model 
in  the  following  way.  In  the  bigram  model  each  non-boundary  word  w  has  the  single  STEP 
disjunct,  allowing  only  links  between  adjacent  words.  In  the  model  Plr  we  add  additional 
disjuncts  to  allow  long-range  links  between  L  and  R.  Specifically,  we  give  all  words  the 
two  disjuncts  STEP  and  HALT.  In  addition,  we  give  L  a  third  disjunct  BRANCHlr,  which 
like  STEP  allows  connections  to  any  word  on  the  left  and  right,  and  in  addition  requires  a 
long-range  connection  to  R.  Similarly,  we  give  R  a  third  disjunct  STEPlr  which  connects 
to  L  on  the  left  and  any  word  on  the  right.  This  allows  linkages  such  as 

...  a  b  L  c  d  e  f  R  g  h  ... 

Now  suppose  that  5  =  «>i  •  •  •  wn  is  a  sentence  containing  a  single  ( L,  R)  pair  separated 
by  at  least  one  word.  The  probability  of  5  is  a  sum  over  two  linkages,  iCbigram  aiid  £lr: 
Plr  (-S^)  =  Plr  (^bigram)  +  Plr  (Ar)-  If  we  let  k  be  the  the  number  of  words  between  L 
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and  Ry  then  under  the  two  approximating  assumptions  made  above  it  is  easy  to  see  that 

Plr  (^lr)  =  c  Dlr  (BRANCHlrI  L  )  Dlr  (STEP)''"*  Dlr  (HALT) 

PLR(W««n)  =  cDLR(STEPlL)B(illi?- 1  ) 

where  c  depends  only  on  bigram  probabilities.  If,  on  the  other  hand,  S  contains  a  single  L 
and  no  R  then  there  is  a  unique  linkage  for  the  sentence  whose  probability  is  Dlr  ( STEP|  L  ) 
times  the  bigram  probability  of  the  sentence.  More  generally,  given  the  model  just  de¬ 
scribed,  it  is  easy  to  write  down  the  probability  PtRi-S)  of  any  sentence  with  respect  to 
the  parameters  D^r  {d\w)  and  B  ( 1^2  |  )• 

We  train  the  parameters  of  this  family  of  models  in  parallel,  using  the  forward- 
backward  algorithm.  In  order  to  do  this,  we  first  make  a  single  pass  through  the  training 
corpus,  accumulating  the  following  counts.  For  each  (L,R)  pair,  we  count  N(k,w\L,R), 
the  number  of  times  that  L  and  R  are  separated  by  exactly  k  >  1  words,  none  of  which 
is  L  or  Ry  and  the  word  immediately  before  R  is  to.  We  also  count  N{-‘R\L)y  the  number 
of  times  L  appears  in  a  sentence  and  either  is  not  followed  by  an  R  in  the  same  sentence 
or  is  followed  by  an  L  before  an  R.  In  terms  of  these  counts,  the  increase  in  log-likelihood 
for  model  P^r  over  the  bigram  model,  in  bits  of  information,  is  given  by 

Gain,R  =  Yi  ML,  R)  log  +  N[-^R\L)\og  d,r  (  step]  L  ) 


where 

Plr  ( A;,  w,  P|  I )  = 

Di,r(BRANCHlr|L)Dlr(STEP)''“‘  Dlr  (halt)  -f  (1  -  Dlr  (  BRANCHlrI  L)^{R\w) 


and 

N(^R\L)  =  c{L)  -Y^i^^ML.R) 

kyW 

with  c{L)  the  unigram  count  of  L.  Using  this  formula,  forward-backward  training  can  be 
quickly  carried  out  in  parallel  for  all  models,  without  further  passes  through  the  corpus. 
The  results  of  this  calculation  are  shown  in  Section  6. 

4  The  Mechanics  of  EM  Estimation 

In  this  section  we  describe  the  mechanics  of  estimating  the  parameters  of  our  model.  Otir 
concern  is  not  the  mathematics  of  the  inside-outside  algorithm  for  maximum-likelihood 
estimation  of  link  grammar  models  [LST92],  but  managing  the  large  quantities  of  data 
that  arise  in  training  om  model  on  a  substantia!  corpus.  We  restrict  our  attention  here  to 
the  short-range  trigrarr*  probabilities  T{z  \  x  y)  since  these  constitute  the  largest  amount 
of  data,  but  our  methods  apply  as  well  to  the  long-range  trigrams  L  (  c  |  .c  y)  and  disjunct 
probabilities  D  ( d  |  x  1/ ). 

To  begin,  observe  that  the  trigram  probabilities  must  in  faw:t  be  EM-trained.  In  a  pure 
trigram  model  the  quantity  T(2  |  x  j/)  is  given  by  the  ratio  c(x  y  c)/  ^■(•'’  V  M  where 

c(x  y  z)  is  the  number  of  times  the  trigram  (x  y  c)  appears  in  the  training  corpus  C. 
But  in  our  link  grammar  model  the  trigram  probabilities  represent  the  conditional  word 
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probabilities  in  the  case  when  the  STEP  linkage  was  used,  and  this  is  probabilistically 
determined.  The  same  ratio  determines  T(5  |  x  y),  but  the  c(j:  y  z)  are  now  expected 
counts,  where  the  expectation  is  with  respect  to  trainable  parameters. 

The  EM  algorithm  begins  with  some  initial  set  T{z  |  x  y)  of  trigram  probabilities. 
For  each  sentence  5  of  the  corpus,  the  algorithm  labels  the  trigrams  (io,_2  u^i-i  Wi)  of 
S  with  these  probabilities.  From  these  and  other  parameters,  the  E-step  determines 
parti2J  estimated  counts  u;,_i  loj.  The  partial  counts  for  a  particular  trigram 

{x  y  z),  accumulated  over  all  instances  of  the  trigram  in  the  corpus,  give  the  trigram’s 
full  estimated  count  c(x  y  z). 

Our  difficulties  occur  in  implementing  the  EM  algorithm  on  the  desired  scale.  To 
explain  these  difficulties,  we  will  briefly  sketch  some  naive  approaches.  We  assume  the 
computer  we  will  use  has  a  substantial  but  not  unlimited  primary  memory,  which  may  be 
read  and  written  at  random,  and  a  much  larger  secondary  memory,  which  must  be  read 
and  written  sequentially.  We  will  treat  the  corpus  C  as  a  series  of  words,  u^i,  luj,  . . . ,  »i;|c|. 
recorded  as  indices  into  a  fixed  vocabulary  V\  this  series  is  marked  off  into  sentences. 
word  index  is  represented  in  2  bytes,  and  a  real  value  in  4  bytes. 

Suppose  we  try  to  assign  trigram  probabilities  T(t«i  |  iw,_2  U’.-i)  to  the  sentence  by 
looking  them  up  in  a  table,  and  likewise  accumulating  the  partial  counts  r}(ie,_2  n’,-i  to,) 
into  a  table.  Both  must  be  randomly  addressable  and  hence  held  in  primary  memory: 
each  table  must  have  space  for  |Vp  entries.  For  realistic  vocabularies  of  size  jV"|  Js;  5  x  lO"*. 
the  two  tables  together  would  occupy  lU*®  bytes,  which  far  exceeds  the  capacity  of  current 
memory  technologies,  primary  or  secondary. 

In  a  corpus  of  \C\  words,  no  more  than  \C\  distinct  trigrams  may  appear.  This  suggests 
that  we  maintain  the  table  by  entering  values  only  for  the  trigrams  (x  y  z)  that  actually 
appear  in  the  corpus.  A  table  entry  will  consist  of  (x  y  z),  and  its  T(;  |  x  y)  and  ()(.r  y  c). 
For  fast  access,  the  table  is  sorted  by  trigram.  Unfortunately,  this  approach  is  also 
impractical.  For  a  moderate  training  corpus  of  25  million  words,  this  table  will  occupy  on 
the  order  of  25  x  10®  x  (6  -I-  4  -t-  4)  =  350  x  10®  bytes,  which  exceeds  the  primary  memory 
of  a  typical  computer. 

Thus  we  are  forced  to  abandon  the  idea  of  maintaining  the  needed  data  in  primary 
memory.  Our  solution  is  to  store  the  probabilities  and  counts  in  secondary  memory;  the 
difficulty  is  that  secondary  memory  must  be  read  and  written  sequentially. 

We  begin  by  dividing  the  corpus  into  R  segments  C',  . . . ,  C^,  each  containing  about 
\C\I R  words.  The  number  of  segments  R  is  chosen  to  be  large  enough  so  that  a  table  of 
\C\I R  real  values  can  comfortably  reside  in  primary  memory.  For  each  segment  C.  which 
is  a  sequence  of  words  iwi,  . . . ,  loc.,  we  write  an  entry  file  E,  with  structure 


(UJ,  W2  W3)  3,  {tUj  W3  104)  4,  ...,  (W|C>|-2  U'IC-I-I  ‘‘'IC'l)  PI- 
We  sort  Ej  by  trigram  to  yield  SE,,  which  has  structure 

{x  y  Xi)  j[‘  "  •  •  • ,  {xy  Z\)  *'],),  (x  y  Zi)  ii*  . (xy  z-i)  .... 

Here  we  have  written  N {x  y  z)  for  the  number  of  times  a  trigram  appears  in  the  .segment, 
and  . . . ,  for  the  sequence  of  positions  where  it  appears.  This  sort  is  done 

one  time  only,  before  the  start  of  EM  training. 

A  single  EM  iteration  proceeds  as  follows.  First  we  perform  an  E-step  on  each  segment 
C*.  We  assume  the  existence  of  a  file  AT^  that  contains  sequentially  arranged  trigram  prob¬ 
abilities  for  C.  (For  the  very  first  EM  iteration,  it  is  easy  to  construct  this  file  l)y  writing 
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out  appropriate  uniforro  probabilities.)  Each  segment’s  E-step  sequentially  writes  a  file 
PCTj  of  partial  estimated  counts,  d(wi  W2  W3),  d(w2  W3  W4),  . 5(iW|c*|-2  w^|c*|-i  ^|C'|)- 

Next  we  sum  these  partial  counts  to  obtain  the  segment  counts.  To  do  this  we  read 
PCTi  into  primary  memory.  Then  we  read  SE,  sequentially  and  accumulate  the  segment 
count  c,(a;  1/  zj)  =  ^  PCTJjjt*  “  and  so  on  for  each  successive  trigram  of  C'. 

As  etich  sum  completes,  we  write  it  sequentially  to  a  file  SCT{  of  segment  counts,  of 
format  c,(x  j/  z\),  c,(x  y  Z2),  •  •  ■  The  trigram  that  identifies  each  count  ran  be  obtained 
by  sequential  inspection  of  SE,. 

Now  we  merge  across  segments,  by  scanning  all  2R  files  SE,  and  SCT,,  and  forming 
the  complete  counts  c{x  y  z)  —  c,(a;  y  z).  As  we  compute  these  sums,  we  maintain 

a  list  c{x  y  ^i),  c{x  y  22),  ...  in  primary  memory — there  will  be  no  more  than  |K|  of 
them — until  we  encounter  a  trigram  (u  v  •)  in  the  input  stream  where  x  ^  ii  ov  y  ^  v. 
Then  we  compute  c(x  y)  =  Ylz^v  2/  ^.nd  dump  the  trigrams  (j  y  2)  and  quotients 
t'(2  I  i'  i/)  =  c{x  y  z)/c{x  y)  sequentially  to  a  file  ST'.  Note  that  ST'  is  a  sorted  list  of 
all  the  reestimated  trigram  probabilities. 

To  complete  the  process  we  must  write  a  sequentially  ordered  file  .AT'  of  the  rees¬ 
timated  trigram  probabilities  for  C*.  First  we  create  a  table  in  primary  memory  of 
size  1C’|.  Then  we  read  SE,  and  ST'  sequentially  as  follows.  For  each  new  trigram 


{x  y  z)  vfe  encounter  in  SE^,  we  search  forward  in  ST'  to  find  t'(2  |  x  y).  Then  for  each 
ji"  ^  •  • . ,  listed  for  (x  j/  2)  in  SE„  we  deposit  t'(2  (  x  y)  in  AT'[j[.''  ^  *’].  When 

SE,  is  exhausted  we  have  filled  each  position  in  AT'.  We  write  it  sequentially  to  disk  and 


are  then  ready  for  the  next  EM  iteration. 


5  Smoothing 

The  link  grammar  model  given  by  (1)  expresses  the  probability  of  a  sentence  in  terms 
of  three  sets  of  more  fundamental  probability  distributions  T,  L  and  D,  so  that  P(S)  = 
P(5;T,  L,  D).  In  the  previous  sections,  we  tacitly  assumed  2-word  history,  non-parametric 
forms.  That  is,  we  allowed  a  separate  free  parameter  for  each  2- word  history  and  prediction 
value  subject  only  to  the  constraints  that  probabilities  sum  to  one.  In  the  case  of  the 
trigram  distribution  T,  for  example,  there  are  separate  parameters  T{io\w'iv'')  for  each 
triple  of  words  {w,w',w")  subject  to  the  constraints  =  1  for  all  {w\  w"). 

We  will  refer  to  such  2-word  history  distributions  ^ls  3-gram  estimators,  since  they  are 
indexed  by  triples,  and  will  denote  them  by  T3,  L3  and  D3.  In  the  previous  section  we 
outlined  an  efficient  implementation  of  EM  (inside-outside)  training,  for  adjusting  the 
parameters  of  T3,  L3  and  D3  to  maximize  the  log-likelihood  of  a  large  corpus  of  training 
text. 

Unfortunately,  we  cannot  expect  the  link  grammar  model  using  maximum  likelihood 
distributions  to  work  well  when  applied  to  new  data.  Rather,  the  distributions  are  likely  to 
be  too  sharply  determined  by  the  training  corpus  to  generalize  well.  This  is  the  standard 
problem  of  overtraining  and  may  be  addressed  by  mixing  the  sharply  defined  distributions 
with  less  sharp  ones  to  obtain  smoother  distributions.  This  procedure  is  referred  to  as 
smoothing. 

The  smoothing  we  employ  in  the  link  grammar  is  motivated  by  the  smoothing  typically 
used  for  the  trigram  language  model  [BBdSM91].  The  idea  is  to  linearly  combine  the  3- 
gram  estimators  T3,  L3  and  D3  with  corresponding  2-gram,  1-gram  and  uniform  estimators 
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to  obtain  smooth  distributions  t,\,  L,\  and  Da-  In  the  case  of  Ta  we  have 

tA{uj|u;'iy")  =  A3T3(io|u)'u;")  +  A2T2(to|ti;')  +  AiTi(t(;)  +  AqTo  .  (2) 

Here  T3,  T2  and  Ti  denote  3-gram,  2-gram  and  1-gram  estimators  for  T,  and  To  denotes  a 
uniform  distribution.  The  2-gram  estimator  T2  has  a  separate  parameter  T2(iy|rw0  for  each 

2- gram  (ww')  subject  to  the  constraint  that  X^„,T2(u;|iy')  =  1.  The  1-gram  estimator  Ti 
has  a  separate  parameter  Ti{u;)  for  each  w.  In  general,  an  n-gram  estimator  depends  on 
n  —  1  words  of  context.  The  parameters  A^  satisfy  the  constraint  A,  =  1  to  ensure  that 
Ta  is  a  probability  distribution.  Equation  (2)  employs  the  same  vector  (Aq,  Aj,  A2,  A3),  for 
each  triple  {w,w\w").  In  practice,  different  vectors  of  A’s  are  used  for  different  triples. 
We  define  La  and  Da  similarly.  We  then  define  the  smooth  link  grammar  model  P\  using 
these  smooth  distributions:  Px(S)  =  P(5;Ta, La,  Da). 

To  completely  specify  the  smooth  distributions,  we  must  fix  the  values  of  the  parame¬ 
ters  of  the  individual  n-gram  distributions  as  well  as  the  mixing  parameters  A.  Estimating 
all  of  these  simultaneously  using  maximum  likelihood  training  would  defeat  the  purpose 
of  smoothing:  we  would  find  that  the  only  non-zero  A’s  would  be  those  multiplying  the 

3- gram  estimators,  which  would  ultimately  train  to  their  maximum  likelihood  (and  thus 
unsmooth)  values!  Instead  we  adopt  the  following  procedure  motivated  by  the  deleted 
interpolation  method  sometimes  used  for  the  trigram  model  [BBdSMQl].  We  first  divide 
our  corpus  of  sentences  into  two  parts:  a  large  training  corpus  T,  and  a  smaller  smoothing 
corpus  S.  We  estimate  the  n-gram  estimators  using  the  training  corpus  only  according 
to  the  following  scheme.  The  3-gram  estimators  T3,  L3  and  D3  are  chosen  to  maximize 
the  log- likelihood  YIsgt  log  P{S]  T3,  L3,  D3)  of  the  training  corpus  using  the  EM  technique 
described  in  the  previous  section. 

The  3-gram  estimators  are  then  used  to  “reveal”  the  hidden  linkages  of  the  training 
corpus,  and  the  2-gram  and  1-gram  estimators  are  chosen  to  maximize  the  likelihood  of  the 
training  corpus  together  with  these  revealed  linkages.  Thus,  for  i  =  1,2,  the  distributions 
Ti,  Li  and  Di  maximize  Ha  L.3,  D3)  log  F(5,  A|Ti,  L,,  Di).  This  procedure, 

while  somewhat  unwieldy  to  explain,  is  simple  to  implement,  as  it  amounts  to  obtaining 
the  2-gram  and  l-gram  estimators  as  appropriate  conditionals  of  the  EM  counts  for  the 
.3-gram  estimators. 

With  the  n-gram  estimators  thus  determined,  we  adjust  the  mixing  parameters  A  to 
maximize  the  probability  of  the  smoothing  corpus  only.  The  logarithm  of  this  probability 
is 

aut.r(A)=  2;iogP(5|tA,LA,DA)=  i:iog^P(.^,A|tA,LA,DA). 

S€5  S€5  a 

This  m2kximization  is  complicated  by  the  fact  that  the  probability  of  a  sentence  now 
involves  not  only  a  sum  over  hidden  linkages,  but  for  each  linkage,  a  sum  over  hidden  A 
indices  as  well.  We  deal  with  this  by  employing  nested  EM  iterations,  as  follows. 

1.  Begin  with  some  initial  A’s. 

2.  By  the  inside-outside  algorithm  described  in  {LST92]  and  the  previous  section,  reveal 
the  hidden  linkages  of  the  smoothing  corpus  using  the  smooth  distributions  t,\,  L.\ 
and  Da  and  accumulate  the  EM  counts  cx,a(0»  ct,.\(0  *nd  cp  a(^)  for  parameters 
t,l,d  of  the  distributions  T,  L  and  D.  These  are  the  counts  obtained  by  maximizing 
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the  auxiliary  function 

J2  E  Ta,  La,  Da)  log  PiS,  A|tv,  Ly.  Da/) 

S€5  a 

with  respect  to  A'.  Their  accumulation  is  the  E-step  of  the  outermost  EM  iterations. 

3.  Form  the  objective  function 

Oi„„er(A')  =  +  5ZcL.A(/)logLv(/)  -|-  CD..\((i)  log  D.\,((/)  . 

t  I  d 

Notice  that  the  A'  indices  are  hidden  in  Oinner(A').  Use  the  forward- backward  algo¬ 
rithm  to  find  the  A's  that  maximize  Oinner(  A')  subject  to  the  appropriate  constraints. 
These  nested  EM  iterations  are  the  M-step  of  the  outermost  EM  iterations. 

4.  Using  these  A's  as  new  guesses  for  the  As,  return  to  step  1,  and  iterate  until  con¬ 
verged. 

Note  that  the  outermost  EM  steps  use  the  inside-outside  algorithm  for  link  grammars; 
the  hidden  parses  are  in  general  context-free  in  generative  power.  However,  step  3,  which 
is  the  M-step  for  the  inside-outside  algorithm,  is  itself  an  EM  estimation  problem.  Here, 
however,  the  hidden  structure  is  regular,  so  the  estimation  can  be  carried  out  using 
the  forward-backward  algorithm  for  probabilistic  finite  state  machines.  The  general  EM 
algorithm  technology  guarantees  that  each  iteration  of  the  above  algorithm  increcises  the 
log-likelihood  (Pouter  of  the  smoothing  corpus  with  respect  to  the  smooth  model  so  far.  In 
practice,  we  have  observed  that  roughly  three  iterations  of  the  outer  EM  iterations  and 
15  iterations  of  the  inner  EM  iterations  suffice  to  smooth  the  parameters  of  our  models. 


6  Sample  Results 

This  section  presents  the  results  of  inferring  and  training  our  long-range  trigrain  model 
on  a  corpus  of  Wall  Street  Journal  data. 

Figure  3  lists  examples  of  the  word  pairs  that  were  discovered  using  the  inference 
scheme  discussed  in  Section  3.  Recall  that  these  pairs  are  discovered  by  training  a  link 
grammar  that  allows  long-range  links  between  a  single,  fixed,  pair  of  words.  A  given  pair 
is  judged  by  the  reduction  in  entropy  that  its  one-link  model  achieves  over  the  bigram 
model.  In  the  table,  this  improvement,  measured  in  bits  of  information,  is  shown  in  the 
third  column.  The  first  section  of  the  table  lists  the  pairs  that  resulted  in  the  greatest 
reduction  in  entropy.  The  fourth  column  of  the  table  gives  the  values  of  the  probability 
D  (  BRANCHlr  I L )  after  forward- backward  training.  This  number  indicates  the  frequency 
with  which  L  generates  R  from  long  range,  according  to  the  trained  model.  The  second 
section  of  the  table  lists  examples  of  pairs  with  high  D  (  BRANCHlr  1  L  ).  The  fifth  column  of 
the  table  gives  the  values  of  the  probability  Dlr  (halt)”'  after  forward- backward  training. 
Recall  that  since  the  number  of  words  between  L  and  R  is  geometrically  distributed  with 
mean  D^r  (HALT)”'  in  the  hidden  model,  a  large  value  in  this  column  indicates  that  L  and 
R  are  on  average  widely  separated  in  the  training  data.  The  third  section  of  the  table 
gives  examples  of  such  pairs.  Finally,  the  fourth  section  of  the  table  shows  the  results  of 
the  word-pair  calculation  applied  to  the  corpus  after  it  was  tagged  with  parts-of-speech. 
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L 

R 

Gainta  x  10® 

D(BRANCHlr|I) 

Di,r(HALT)  ‘ 

( 

) 

472.944 

0.808 

2.277 

n 

W 

80.501 

0.089 

3.041 

between 

and 

57.097 

0.674 

2.002 

[ 

] 

54.287 

0.907 

2.644 

neither 

nor 

22.883 

0.588 

2.030 

either 

or 

16.892 

0.496 

3.083 

both 

and 

14.915 

0.277 

1.786 

- 

- 

14.909 

0.074 

5.309 

* 

14.039 

0.117 

3.845 

from 

to 

13.021 

0.044 

1.931 

tit 

tat 

0.344 

0.835 

2.049 

to-preheat 

oven 

1.663 

0.773 

1.084 

to.whet 

appetite 

0.521 

0.709 

1.943 

nook 

cranny 

0.618 

0.619 

2.426 

to^ex 

muscle 

0.702 

0.548 

1.784 

sigh 

relief 

0.624 

0.411 

2.123 

loaf 

bread 

0.434 

0.308 

2.795 

quarterback 

touchdown 

0.167 

0.027 

5.715 

inning 

hit 

0.018 

5.673 

farmer 

crop 

0.347 

0.023 

5.609 

investor 

stock 

0.014 

5.149 

firefighter 

blaze 

0.513 

0.071 

4.9.55 

whether 

or 

5.123 

0.124 

4.925 

she 

her 

9.672 

0.078 

4.007 

to-describe 

as 

9.022 

0.457 

tojrise 

to 

7.654 

0.261 

2.437 

to4>revent 

from 

7.491 

0.407 

3.743 

to.turn 

into 

6.642 

0.174 

3.566 

to^ttribute 

to 

5.679 

0.904 

4.189 

to-view 

as 

5.193 

0.524 

3.425 

toJ>ring 

to 

4.960 

0.237 

3.836 

tojange 

to 

4.864 

0.660 

5.356 

Figure  3:  Sample  word  pairs 


Figure  4:  2.5M  corpus 


Figure  5:  25m  corpus 
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The  search  Wcis  restricted  to  verb-preposition  pairs,  and  some  of  the  pairs  which  yielded 
the  greatest  reduction  in  entropy  are  shown  here. 

In  Figures  4  and  5  we  show  plots  of  perplexity  as  a  function  of  iteration  in  the  EM 
training  of  the  long-range  trigram  model  described  in  Section  2,  using  the  word-pair 
“grammar”  that  was  automatically  extracted.  These  graphs  plot  the  perplexity  as  a 
function  of  iteration,  with  the  trigram  perplexity  shown  as  a  horizontal  line.  In  the  first 
plot,  carried  out  over  a  training  set  of  2,521,112  words,  the  perplexity  falls  approximately 
12.7%  below  the  trigram  perplexity  after  9  iterations.  After  smoothing  as  described  in 
Section  5,  the  perplexity  on  test  data  was  approximately  4.3%  below  the  smoothed  trigram 
perplexity.  In  the  second  plot,  carried  out  over  a  training  set  of  25,585,580  words,  the 
perplexity  falls  approximately  8%  below  the  trigram  perplexity  after  6  iterations.  .After 
smoothing  this  model,  the  perplexity  on  test  data  was  approximately  5.3%  under  the 
smoothed  trigram  perplexity.  The  fact  that  the  magnitude  of  the  entropy  reduction  on 
training  data  is  not  preserved  after  smoothing  and  evaluating  on  test  data  is  an  indication 
that  the  smoothing  may  be  sensitive  to  the  “bucketing”  of  the  A’s. 

These  perplexity  results  are  consistent  with  the  observation  that  for  a  fixed  word-pair 
grammar,  as  the  training  corpus  grows  in  size  the  long-range  trigram  model  becomes  a 
small  perturbation  of  the  standard  trigram  model.  This  is  because  the  number  of  flisjunct 
parameters  D  (d  |  lui  iW2  )  and  long-range  trigram  parameters  L  ( u;  |  tU2  )  is  on  the  order 
of  the  number  of  bigrams,  which  becomes  negligible  compared  to  the  number  of  trigram 
parameters  as  the  training  set  grows  in  size. 

The  smoothed  models  were  incorporated  into  the  Candide  system  for  machine  trans¬ 
lation  [BBP'''94].  When  compared  with  translations  obtained  with  the  system  using  the 
standard  trigram  model,  our  long-range  model  showed  a  slight  advantage  overall.  For  ex¬ 
ample,  the  French  sentence  “Manille  a  manque  d’electricite  pendant  dix  heures  mercredi,” 
which  was  translated  as  “Manila  has  run  out  of  electricity  for  ten  hours  Wednesday”  using 
the  standard  language  model,  was  translated  cis  “Manila  lacked  electricity  for  ten  hours 
Wednesday”  using  the  link  grammar  model. 

While  the  long-range  trigram  model  that  we  have  described  in  this  paper  rei)resents 
only  a  small  change  in  the  trigram  model  itself,  we  believe  that  the  techniques  we  develoj) 
here  demonstrate  the  viability  of  more  complex  link  grammar  models,  and  show  that 
significant  improvements  can  be  obtained  using  this  approach. 


12 


References 

[BBdSM91]  L.R.  Bahl,  P.F.  Brown,  P.V.  de  Souza,  and  R.L.  Mercer.  Tree-based  smooth¬ 
ing  algorithm  for  a  trigram  language  speech  recognition  model.  IBM  Tech¬ 
nical  Disclosure  Bulletin,  34(7B);380-383,  December  1991. 

[BBP‘''94]  A.  Berger,  P.  Brown,  S.  Della  Pietra,  V.  Della  Pietra,  J.  Gillett,  J.  Laf- 
ferty,  R.  Mercer,  H.  Printz,  and  L.  Ures.  The  Candide  system  for  machine 
translation.  In  Human  Language  Technologies.  Morgan  Kaufman  Publishers, 
1994. 

[BT73]  T.  Booth  and  R.  Thompson.  Applying  probability  measures  to  abstract 
languages.  IEEE  Transactions  on  Computers,  0-22:442-450,  1973. 

[LST92]  J.  Lafferty,  D.  Sleator,  and  D.  Temperley.  Grammatical  trigrams:  A  proba¬ 
bilistic  model  of  link  grammar.  In  Proceedings  of  the  AAAI  Fall  Symposium 
on  Probabilistic  Approaches  to  Natural  Language,  Cambridge,  MA,  1992. 

[ST91]  D.  Sleator  and  D.  Temjjerley.  Parsing  English  with  a  link  grammar.  Tech¬ 
nical  Report  CMU-CS-91-196,  School  of  Computer  Science,  Carnegie  Mellon 
University,  1991. 


13 


Appendix  A:  Enumerating  Linkages 

In  this  appendix  we  derive  a  formula  for  the  number  of  linkages  of  the  model  described 
in  Section  1  when  the  grammar  allows  long-range  connections  between  any  pair  of  words. 

There  is  a  natural  correspondence  between  the  linkages  of  model  (1)  and  trees  where 
each  node  has  either  zero,  one,  or  two  children.  A  node  having  one  child  will  be  called 
unary  and  a  node  having  two  children  will  be  called  binary.  Let  am.n  be  the  number  of 
trees  having  m  unaxy  nodes  and  n  binary  nodes.  Then  Om.n  satisfies  the  recurrence 

0>m.,n  —  Om— l,n  "t"  ^  ^  ^  Ufc,  /  ^m— fc.n— /— 1  • 

0<it<m  0</<n-l 

Thus,  the  generating  function  T{x,y)  =  ^m,n>o  satisfies  the  equation 

r(x,y)  =  1  +  xTix,y) +  yT^{x,y) . 

Since  T(0, 0)  =  1  we  have  that 


T{x,y) 


I  -  X  -  yf{l  -  xY  -  \y 
2^ 


The  total  number  of  nodes  in  a  tree  that  has  m  unary  nodes  and  n  binary  nodes  is 
2n  -|-  m  -|-  1.  Therefore,  if  S{z)  =  Ylk>Q  the  generating  function  given  by  S{:)  = 

zT{z,z^),  then 

2n+m+l=jt 

and  Sk  is  the  number  of  trees  having  a  total  of  k  nodes.  S  is  given  by 


5(2)  = 


I  —  2  —  \J{1  —  2)^  —  42^ 
Yz 


1  —  2  —  \/\  —  32^"!  2 

22 


is..?., 


i+1 


While  we  are  unable  to  find  a  closed  form  expression  for  the  coefficients 


^  Cf)( 

0<i<fc+l  \  *  /  \ 


1/2 

k+l- 


.) 


a  few  of  the  values  are  displayed  below. 


k 

12  3  4 

5  6  7 

8 

9 

10 

20 

25 

•Sfc 

112  4 

9  21  51 

127 

323 

835 

18,199,284 

3,192,727,797 
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Appendix  B:  Deficiency 


We  say  that  a  language  model  is  deficient  if  it  assigns  a  probability  that  is  smaller  than 
one  to  the  set  of  strings  it  is  designed  to  model.  There  are  several  ways  in  which  a 
probabilistic  link  grammar  can  be  deficient.  One  such  way  is  if  the  total  probability  of 
finite  linkages  is  smaller  than  one.  In  this  appendix  we  derive  conditions  under  which  this 
type  of  deficiency  can  occur  for  a  simplified  version  of  our  model.  The  general  analysis  is 
similar,  but  more  intricate  [BT73]. 

Following  the  notation  of  Appendix  A,  suppose  that  we  generate  trees  probabilistically 
with  a  node  having  zero  children  with  probability  po,  one  child  with  probability  pi,  and 
two  children  with  probability  p2,  irrespective  of  the  label  of  the  node.  These  probabilities 
correspond  to  the  disjunct  probabilities  D  ( HALT  i  s  ),  D  ( STEP  |  s  ),  and  D  (  BRANCH  |  .s ). 
We  ignore  the  short  and  long-range  trigram  probabilities  in  this  simplified  model.  The 
probability  of  generating  a  tree  with  m  unary  nodes  and  n  binary  nodes  is  then  Po'*’*  p'P  P2 . 
The  total  probability  assigned  to  finite  trees  is 

Tfinite  =  pT  P2  =  Po  ^(Pi >  Po  P2 )  • 

m,n>0 

Using  the  calculations  of  Appendix  A,  this  leads  directly  to  the  relation 


Tfinite 


1  -  Pi  -  \/(l  -Pi)^  -4pop2 

2p2 

Po  +P2  -  IPO  -P2I 


2p2 


In  terms  of  the  expected  number  of  children  F[n]  =  Pi  +  2p2,  we  can  state  this  as 


Tfinite  = 


( 


1 

P0/P2 


F[n]  <  1 
E[n\  >  1 


More  generally,  for  n-ary  trees  with  probability  pi  of  generating  i  children,  with  0  <  /  <  11. 
^finite  is  the  smallest  root  of  the  equation 


E  PiP 

0<«<n 


and  Tfinite  =  1  in  case  E[n]  <  1.  This  is  a  well-known  result  in  the  theory  of  branching 
processes. 
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